IntoNews: Online news retrieval using closed captions

نویسندگان

  • Roi Blanco
  • Gianmarco De Francisci Morales
  • Fabrizio Silvestri
چکیده

We present IntoNews, a system to match online news articles with spoken news from a television newscasts represented by closed captions. We formalize the news matching problem as two independent tasks: closed captions segmentation and news retrieval. The system segments closed captions by using a windowing scheme: sliding or tumbling window. Next, it uses each segment to build a query by extracting representative terms. The query is used to retrieve previously indexed news articles from a search engine. To detect when a new article should be surfaced, the system compares the set of retrieved articles with the previously retrieved one. The intuition is that if the di↵erence between these sets is large enough, it is likely that the topic of the newscast currently on air has changed and a new article should be displayed to the user. In order to evaluate IntoNews, we build a test collection using data coming from a second screen application and a major online news aggregator. The dataset is manually segmented and annotated by expert assessors, and used as our ground truth. It is freely available for download through the Webscope program. Our evaluation is based on a set of novel time-relevance metrics that take into account three di↵erent aspects of the problem at hand: precision, timeliness and coverage. We compare our algorithms against the best method previously proposed in literature for this problem. Experiments show the trade-o↵s involved among precision, timeliness and coverage of the airing news. Our best method is four times more accurate than the baseline. Email addresses: [email protected] (Roi Blanco), [email protected] (Gianmarco De Francisci Morales), [email protected] (Fabrizio Silvestri) Preprint submitted to Elsevier November 29, 2013

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automated closed-captioning using text alignment

The production of closed captions is an important but expensive process in video broadcasting. We propose a method to generate highly accurate off-line captions efficiently. Our system uses text alignment to synchronize program transcripts obtained for a video program with text produced by an automatic speech recognition (ASR) system. We will also describe the accuracy in both closed-caption te...

متن کامل

Integrating multi-modal content analysis and hyperbolic visualization for large-scale news video retrieval and exploration

In this paper, we have developed a novel scheme to achieve more effective analysis, retrieval and exploration of large-scale news video collections by performing multi-modal video content analysis and synchronization. First, automatic keyword extraction is performed on news closed captions and audio channels to detect the most interesting news topics (i.e., keywords for news topic interpretatio...

متن کامل

Story Segmentation and Detection of Commercials in Broadcast News Video

The Informedia Digital Library Project [Wactlar96] allows full content indexing and retrieval of text, audio and video material. Segmentation is an integral process in the Informedia digital video library. The success of the Informedia project hinges on two critical assumptions: that we can extract sufficiently accurate speech recognition transcripts from the broadcast audio and that we can seg...

متن کامل

Using Text Surrounding Method to Enhance Retrieval of Online Images by Google Search Engine

Purpose: the current research aimed to compare the effectiveness of various tags and codes for retrieving images from the Google. Design/methodology: selected images with different characteristics in a registered domain were carefully studied. The exception was that special conceptual features have been apportioned for each group of images separately. In this regard, each group image surr...

متن کامل

Laughter extracted from television closed captions as speech recognizer training data

Closed captions in television broadcasts, intended to aid the hearing impaired, also have potential as training data for speech-recognition software. Use of closed captions for automatic extraction of virtually unlimited training data has already been demonstrated [1]. This paper reports some preliminary work on the use of non-speech sound tokens included in closed captions to extract training ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Inf. Process. Manage.

دوره 51  شماره 

صفحات  -

تاریخ انتشار 2015